Hadoop: AWS Big Data Services

In this module, we'll take a look at each of the major big data services on the Amazon Cloud. We'll begin with Simple Storage Service (S3) and Amazon Athena, which together constitute the core data lake offering from AWS. We'll then take a look at Elastic MapReduce, which performs batch analytics on top of data that sits in the lake. We'll move onto Redshift, which is Amazon's cloud data warehouse. We'll talk about Glue and Data Pipeline, which together offer data preparation and data integration services. We'll then discuss DynamoDB and some of the other NoSQL offerings on the Amazon Cloud. And we'll finish up with a discussion of the Relational Database Service, or RDS, which handles core transactional database workloads. With that sort of table of contents out of the way, let's they'll at S3, the Simple Storage Service. It's where cloud applications of all kinds will store their files. But in addition to being a general file store, it's also the storage layer for Amazon's data lake offering, and that's because data stored in open data file formats, when put in an Amazon S3 bucket or a series of those buckets, constitutes a data lake, just like those same types of files stored in other file systems like the Hadoop Distributed File System on premises, for example. Speaking of the Hadoop Distributed File System, S3 has a very special relationship with Elastic MapReduce, which is Amazon's core Hadoop offering. EMR can see data stored inside of an S3 bucket the same way as vanilla Hadoop would see data stored in its own file system. Now, in addition to these data lake services, it turns out virtually all AWS data services connect to S3. S3 is really the low-level integration layer between almost every AWS data service. There is one important exception, and we'll get to that later in the module. And vocabulary-wise, just keep in mind that S3's unit of deployment is a bucket, and within buckets, you can store files and folders, which in turn may store more files and folders. Next, let's talk about Athena,

Athena
Athena which is almost a companion service to S3 when S3 is used a data lake. And the reason for that is Athena can provide a direct SQL query layer over data stored in S3. It's based on a couple of open-source data lake technologies called Presto, and Apache Hive. And in fact, both of those are available in Elastic MapReduce as well. S3 coupled with Athena, on the other hand, gives you a kind of minimalist data lake analytics suite, treating that vast array of options that you get on Elastic MapReduce with some straightforward simplicity.

Another key difference for Athena is that its pricing is consumption based. You pay only for the queries you run. That contrasts quite a bit with Elastic MapReduce, for which you build as long as the cluster is running, whether or not you're running queries on that cluster. And because that consumption-based pricing is based on the amount of physical data actually scanned,
Athena works great with columnar file formats like Parquet and ORC because those formats use data compression to store a large amount of data in a relatively small amount of space.
Elastic MapReduce, which really is the big daddy of big data services on AWS. It's one of the oldest services; it's one of the most comprehensive services. It's based primarily on Apache Hadoop and Apache Spark, which are open-source big data execution platforms, but it offers lots of other options for other open-source technologies that can be deployed on its clusters. It is like many other Amazon data services, tightly integrated with S3.
But in the case of Elastic MapReduce, there's a special integration called the Elastic MapReduce file system, which makes S3 appear to Elastic MapReduce as if it were the native Hadoop file system.
Now because of its basis on Apache Hadoop and Spark, EMR primarily handles big data and data lake analytics workloads, but it handles other ones as well. For example, it can handle streaming data workloads, data integration tasks, it can even take on machine learning and artificial intelligence. And because EMR has been around for so long and established itself as a standard, there's an impressive third-party ecosystem built around it.

Redshift, Glue

warehousing, and let's recognize that Amazon Redshift really established the category of cloud data warehouses, and with it, it really breathed new life into the data warehouse market overall. It was AWS's fastest growing data service for years, and it's now wonder, because Redshift introduced the notion of elastic scalability to the data warehousing world. On the other hand, in the case of Redshift, the scalability for compute and storage must go in lockstep, and that's because Redshift does not use S3 for its storage layer, but rather uses the SSD drives on the nodes in the cluster itself. As a result, those clusters need to run 24/7 with the associated costs involved in that continuous operation. As with EMR, Redshift established itself early as a standard, and as a result, has huge ecosystem support. And although discussion of Amazon competitors is largely beyond the scope of this course, I'll mention one particular competitor, which is Snowflake. Snowflake can run on AWS, and it does use S3 for its storage layer, allowing compute and storage to scale separately. The fact that Snowflake and some other competitors work that way has caused some competitive friction for Redshift. Now let's talk about Amazon Glue, which really handles two different missions. On the one hand, it provides a data catalog, and that data catalog component is used by other Amazon data services. In fact, we already mentioned that Amazon Athena leverages the data catalog component of Glue. But Glue also provides a data integration/data prep platform. In the Amazon Glue world, so-called crawlers will take a look at existing data sources and scan those data sources for metadata information about the data within them. As the crawlers do their work, they add tables to the Glue data catalog, and once those tables are in place, data integration or data prep jobs can be run against those tables. And as we discuss each of these services, we also mention the integrations between them and others. In the case of Glue, there is tight integration once again with S3, as well as DynamoDB, Redshift, RDS, and even external databases. And the last three of those are made possible by a technology called JDBC, for Java Database Connectivity. When you design data integration and data prep jobs, you get a visual interface to get the work done. But as it turns out, that visual interface is actually generating code that runs on Spark. Now, the Spark cluster the code runs on is not an EMR cluster; it's a Spark cluster that Amazon manages for you, and as such, it's able to provide customers with consumption-based pricing. The code that it generates uses a high-level API specific to Amazon Glue, and that code is editable to a point. But the thing to keep in mind is that the visual interface is for initial authoring and not for editing of the jobs. Now we've already mentioned that the data catalog component of Amazon Glue is utilized by other services, but that understates a point, which is that this data catalog subcomponent is strategic to Amazon in its approach to data lake architecture and the data lake market.
Lake Formation, Data Pipeline, NoSQL Services, RDS and Aurora

And speaking of Glue, Amazon has a relatively new offering called Lake Formation that is there to accelerate the process of creating your data lake in the first place. It automates things like setting up of security and access controls, partitioning of the data files, as well as deduplicating, cleansing, and classifying the data. And as it turns out, Lake Formation builds on and heavily leverages Glue, so you can really think of it as an acceleration layer that helps customers implement Amazon Glue and get their data lakes up and running. Moving on from Glue and its associated service, Lake Formation, we'll talk about Data Pipeline, and while pipeline doesn't offer a data catalog, it does take on some of the data integration and data movement tasks that Glue can take on as well. It takes its own very visual approach to doing this though. In effect, Data Pipeline users build out data flows or diagrams that contain various boxes and lines. The boxes represent different repositories and engines, and each one of them has a number of properties, and very specific values need to be filled in for those properties. There's a great integration story here too. Data pipeline interfaces with S3, Redshift, and DynamoDB. The pipelines can be run on demand or run on a scheduled basis, and it's important to remember that although it is a 100% visual environment, and it's code free because of that, it is still an advanced service and not one that is oriented towards beginners. Now NoSQL databases are special data repositories that can store the data without a strict schema having to be established first, and there are four major kinds of NoSQL databases. The most basic of those is a key-value store, and DynamoDB is Amazon's NoSQL key-value store service. If you're interested in some of the other types of NoSQL databases, you may be interested in DocumentDB, which implements a NoSQL document store, Neptune, which implements a graph database on the Amazon Cloud, and if you spin up an Elastic MapReduce cluster, you can optionally include Apache HBase, which will implement a column family store NoSQL database. Together, those are the four major NoSQL categories, and Amazon has an offering for each one of them. It actually has a second key-value store service called SimpleDB. This predates the offering of DynamoDB, and Amazon's guidance now for new projects is to use DynamoDB. Effectively, SimpleDB has been deprecated, but it is still available on the Amazon Cloud. Now although we're talking about big data services in this course and most of those are geared toward analytical workloads, it's also important to understand that Amazon has a number of offerings in the traditional online transactional processing realm of relational databases, and the core offering there is something called Relational Database Service. You can really view RDS as a family of services, because under the RDS umbrella, you can deploy the major commercial databases, Oracle and SQL Server, as well as open-source databases like MySQL, MariaDB, and PostgreSQL, also known as simply Postgres. And Amazon also offers a sort of house brand relational database called Aurora. It is a cloud-native database, designed by Amazon, runs in a serverless fashion, and can run in modes compatible with both MySQL and Postgres.
Mapping the Services, Demo Lead-in

Now we've talked about a number of individual services. Let's tie them back together by listing them out and understanding in which particular big data technology category each one falls. You'll recall that Redshift is Amazon's cloud data warehouse service, that S3 and Athena together form the core data lake offering. Elastic MapReduce is the batch analytic service that can operate on the data in the data lake. RDS is the core relational database service or operational database family of services. DynamoDB is the headline NoSQL database service from Amazon even though we saw that there are other NoSQL offerings in there as well. Now Kinesis and MSK are services we're going to talk about later in the course, but we'll list them here for completeness and note that they are services dedicated to handling streaming data workloads. Glue and data pipeline are the data integration, data preparation services, and we also saw how Glue implements a strategically important data catalog subcomponent. And another service we'll talk about later on in the course is Amazon SageMaker, and that is its poster child for artificial intelligence and machine learning.
Demo: EMR Provisioning

Now with that summary behind us, we can pause for a quick demo. This is not a demo-heavy course, but some context is still really helpful here. So I'll show you how we can provision an Elastic MapReduce cluster on AWS, and you'll see that there is a vast menu of various open-source software analytics components that we can choose from as we perform that deployment. Let's take a look. Let's now take a look at the EMR cluster creation process. When you come to the EMR page in the Amazon Web Services console, you'll see a list of running clusters if you have any, and if not, you'll see this welcome screen and this blue button that allows you to create a cluster. We'll go ahead and push that and see what happens next. By default, you'll come into the Quick Options mode of the Create Cluster page. And here you'll be able to choose from one of four cluster configurations, one each for Core Hadoop, HBase, Presto, and Spark. You'll also notice an option to use the AWS Glue Data Catalog as the metastore for Hive or for whichever SQL engine component is included in the cluster type you've selected. Scrolling down, you can specify an instance type for the nodes in your cluster and optionally pick a key pair for secure connections to the command line interface on the head node in the cluster. Once you've made these selections, you can click Create cluster, and your cluster will be created for you. And that is the Quick Options experience. But if we scroll back up and click on the Go to advanced options link, we'll see options for various open-source analytics components. By default, Hadoop, Hive, Hue, and Pig are selected, but we could select any of these components on an a la carte basis. Note if you pick more than one component with a SQL engine in it, and here I'll pick Spark and Presto, that we have options for each of those engines to use the Glue Data Catalog or not. In the Advanced Options mode, you'll need to go through three more screens before creating your cluster, one each for hardware options, cluster settings, and security. When you're done selecting options on that fourth screen, you'll see a Create cluster button there as well.
Batch Analytics with Elastic MapReduce (EMR)
Overview, MapReduce Diagrammed, Open Source Analytics, EMR Optional Components

I'm Andrew Brust, and this module is Batch Analytics with Elastic MapReduce. In this module, we'll take a look at the big daddy of Amazon big data technologies. We'll get an understanding for the subcomponents within it and how it integrates with other services and fits into the larger Amazon data lake story. We'll finish with a demo of the user experience and the tooling that brings a lot of these subcomponents together. Before we look at Elastic MapReduce itself, let's get an understanding for the MapReduce algorithm from which it draws its name and on which Hadoop was originally based. This algorithm is no longer used all the time in big data, but it still conveys the overall approach to taking large datasets and doing analytics on them. We start with the basic premise of having an arbitrarily large input dataset and splitting it into a number of smaller input files. Now don't take the fact that there are six such files shown on this slide literally. In practice, we could have orders of magnitude more. But each one of these input files gets sent to a dedicated node in our EMR cluster, and each of those nodes runs a piece of code called the mapper function. The mapper function parses the input data and outputs it into a transformed format of key and value pairs. Those output files then go through a process called shuffling, where the data is combined and collated by key. Each of these shuffle files then act as inputs that get sent to nodes in the cluster as well. Those nodes run a reduce function that takes all the values for a given key and outputs exactly one value. Oftentimes, that one value is simply a sum of all the constituent values, thereby making the reducer function a simple aggregation. But in point of fact, any logic can be executed there, as long as exactly one value is output for each key. The outputs from each reducer node can then be combined to form the output for the entire MapReduce job. Now again, this algorithm is not used in every case in big data processing, but this basic notion of taking a large input dataset, splitting it up into a bunch of smaller input files, and processing each of those separately and simultaneously is typically what big data is all about. Now let's talk about a basic tenet of open-source analytics technology. All of the components that tend to get used in that arena and with Elastic MapReduce are defined by the ecosystem around Apache Hadoop. Some of the more prominent components in that list include Hadoop itself and Apache Tez; Hive, which is a SQL query engine over data in Hadoop or just in S3; HBase, which as we mentioned in the previous module is a NoSQL column family database over that data; Spark, which is an alternative to Hadoop for processing big data; and Presto, which is an open-source SQL query engine that can connect to big data in Hadoop or a data lake, but to other sources as well. The ability to use so many different engines derives from the fact that big data and data lake technologies are based on the notion that the data is stored in relatively neutral formats so that a collection of different engines can be brought in combination to process and query that data. To make this a little more concrete, here's a screenshot showing all of the different components that can be deployed on an EMR cluster. The components with blue checkmarks are the ones that are included by default, but there are many others in this list that are important. To begin with is Apache Spark, which is used more and more frequently in big data analytics scenarios today, and the code against Spark is typically stored in and executed from so-called notebooks that can run either on the Jupyter platform or a similar platform called Zeppelin; Taz and Hbase we've already mentioned are key Hadoop Components; Presto, which is another SQL query engine that can be used against data in Hadoop; Flink, which is another Apache project focused on streaming data; and TensorFlow, which is a very popular library and framework for doing deep- learning artificial intelligence work.
Hue, SQL Options, Columnar Files, EMR Connections, Data Lake Stack, Demo Lead-in

Now that array of components can be a little bit overwhelming, but there is a tooling experience that brings many of them together. That's another Apache project called Hue. Hue provides a number of editors with syntax completion and simple charting that allow you to do queries against Hive and run scripts against Apache Pig. There are various browsers for looking at the data that's in the Hive metastore, as well as data files stored in HDFS or in S3. And there's a file viewer for looking at any of those files in those storage layers that might be human readable. Finally, Hue has its own notebooks facility, distinct from those notebooks inside of Jupyter and Zeppelin. These are less commonly used. But having all of these under one user interface umbrella can be very convenient indeed. Now we've talked about a number of different technologies that can be used to query data and Elastic MapReduce with SQL, but let's try and get a full inventory in here this slide. So within Elastic MapReduce itself, there's Hive, there's the Spark SQL subcomponent of Apache Spark, there's Presto, and an additional component called Phoenix, which was shown in the screenshot of all the different subcomponents, but it's not that frequently used. Outside of Elastic MapReduce, don't forget, as we've discussed in the previous module, Amazon Athena can be used to query the data in S3 that may have been produced or updated or otherwise modified by Elastic MapReduce, and Athena is also based on Presto. Now the ability to bring many engines to the same data is based on the premise that the data is stored in relatively open and agnostic file formats. Two of those formats, namely Parquet and ORC, each of which is an Apache Software Foundation project in its own right, are especially well suited to analytics workloads. The reason for that is that these file formats store all the data for a given data column contiguously. That makes aggregating all of those values especially efficient, but it also brings about economic benefits as well, since data stored in that format can be highly compressed. This leads to smaller files and therefore lower storage costs. But as we mentioned in the previous module, services like Amazon Athena, which are built based on the amount of physical data actually scanned, will yield lower charges as well when used with compressed file formats. In general, these two formats, Parquet and ORC, are core to data lake implementations, both on-premises and across the different public clouds. Now let's get a sense for how EMR connects with other Amazon services. We've already said a number of times how EMR uses S3 as its persistent storage format, and EMR also has direct integrations with the operational databases in Relational Database Service, with Amazon's data warehouse, Redshift, and with Amazon's banner NoSQL technology, DynamoDB. Let's go a step further though and understand how all of these different subcomponents interact and combine to form a full data lake story. We start with S3 and the file formats themselves, including the Parquet and ORC formats we just discussed, an additional format called Apache Avro, and the decades-old flat file storage format CSV, which stands for comma- separated values. This is all on the storage layer of course, so let's talk about what happens at the processing layer. We begin with Elastic MapReduce and specifically its Hive and Spark components. And by Spark, we mean both core Spark and Spark SQL. Now these components can both read from and write to the data stored in S3, and in addition, these components on the right have the ability to read from that data. On the write-only side, we have the streaming data services of Amazon Kinesis and MSK.
Demo: Hue and Its Multiple Editors

Now all of those components can be a little overwhelming, so let's do a quick demo where we can see Apache Hue and how it ties a number of these components together in a single user interface experience. We'll see how Hue can help manage MapReduce jobs, facilitate the editing and execution of scripts against Apache Pig, and also accommodate queries against Apache HBase, data stored in Hive, and that same data queried with Spark SQL. We'll then take a look at querying Hive from the command line, just to contrast it with the user interface experience. Let's take a look. And here we are in Apache Hue, which provides user interfaces for most of the components inside of Elastic MapReduce. To begin with, we're looking at a user interface for running MapReduce jobs, and in this particular case, we're looking at running word count, which is a very common Hadoop sample MapReduce job. Moving on, we also have a browser for Apache HBase, and in this particular case, we're looking at a table called customers that I created previously. Notice that because HBase is a NoSQL database, we can have differing schema from row to row. Notice that for our first row of data, we have a column that is unique to that row called MiddleName, and for the second row, we have a unique column called Suffix. In addition to simply browsing the data in the table without any particular sophistication, we could also run very specific queries. In this case, we'll be looking at the FirstName column for the row with Brust as the key, and the LastName column for the row with Howell as the key. If I hit Enter, we'll get back just that particular data. Next, we'll look at the user interface for Apache Pig, where we can run entire Pig Latin scripts. This very simple script will take the contents of two different files that we've uploaded to HDFS, specifically midsummer_freq and sonnets_freq. We'll join the contents of those two files by the word column, and create a third file that will be output into a folder called willie_shakes_joined. Let's go ahead and execute this script, and with execution complete, let's refresh our view of HDFS. Note that we have a new folder called willie_shakes_joined. And if we open that folder, we see the output of our job in this particular sequence file, and we can actually browse the contents of the file right here in Hue. Let's move on now to our user interface for Apache Hive, and this particular interface, by the way, is probably the one most commonly used inside of Hue. In this particular case, we're going to create an empty table for some employee data. Let's go ahead and run that. And with that query complete, we can run a second query that will load data into the table, specifically by using the data in a tab-delimited file that we had already uploaded. That file is called sample data.tsv, and we can see it in the HDFS browser as well. We'll go ahead and run this query. And with that table creation and load now complete, we have one more query we could take a look at, which will select all the data out of that table. Rather than run it here in the Hive user interface though, let's move over to the Spark SQL interface, which is almost identical. And Spark SQL can see the same data that Hive can see because they both use the same metastore. If we refresh the database browser here to see the different tables in our default database, we see the employee table as present. It has the columns we defined in our create table script, and if we run this particular query, we should be able to see the contents of the table, and indeed, those are displayed towards the bottom of the screen. And just so we can fully appreciate the power of Hue, note that our alternative would be to work at the command line by opening up a so-called SSH session to the head node in our cluster. In this particular case, we've opened up the command line interface for Hive, and we have the same query here, select * from employee. We'll put a semicolon at the end and hit Enter. And you see we get the same results here. Not nearly the user friendliness, no particular assistance as we do our work, no help. Everything's very bare bones. It's good to know we have this as an option, especially since it is a fairly constant and consistent experience across different versions of Hadoop. However, most of us would rather work in an environment such as this one.
AI and Streaming Data Processing with EMR
Overview, Streaming Data Options

I'm Andrew Brust, and this module is AI and Streaming Data on AWS. In this module, we'll take a look at the numerous options for doing artificial intelligence work and working with streaming data and streaming data processing on the Amazon Cloud. As you'll see, this involves a combination of standalone Amazon services as well as subcomponents in Elastic MapReduce. We'll take a look at both so you can understand each of them in their own proper context and also get a sense for how they interact. Let's start with streaming data, and there are as many options of working with streaming data on the Amazon Cloud as there are in the greater open-source software world. To begin with, we'll talk about Apache Kafka, which is perhaps the most popular open-source framework for setting up data streams and processing the data from those streams. The Amazon service Managed Streaming for Apache Kafka, or MSK for short, is a managed service from Amazon for deployment and operation of Apache Kafka clusters. Those clusters can then be integrated with other streaming data components and other data lake components as appropriate. Next is the Spark Streaming subcomponent of Apache Spark, itself an optional subcomponent on Elastic MapReduce clusters. Spark Streaming is very popular because it integrates with the other workloads Spark can handle, including big data processing, data engineering, big data analytics, with both core Spark and Spark SQL, and indeed machine learning and AI work that can be done on Spark as well. Apache Storm is another popular but older open-source framework for working with streaming data. There is no dedicated Amazon service for Apache Storm, nor is it available as an optional subcomponent on Elastic MapReduce. Apache Storm clusters can nonetheless be deployed to the Amazon Cloud by using the Elastic Compute Cloud, or EC2 infrastructure-as-a-service layer. Best practices for deploying Storm to EC2 are well documented and widely available. Next is Flink, another Apache open-source software project for working with streaming data. Like Spark, it is available as an optional subcomponent on Elastic MapReduce clusters and can be used very productively in conjunction with the other components on the Amazon data lake stack. Finally, there's Kinesis, which is Amazon's own proprietary and native service for streaming data, and that too can be integrated with the various components and services that make up the Amazon data lake stack.
AI: Machine Learning and Deep Learning, AI on EMR, Demo Lead-in

Moving on to artificial intelligence, machine learning, and deep learning, here again we have a number of options, some of them as dedicated services, others as subcomponents of Elastic MapReduce. We begin with the first service, which is Amazon Machine Learning. This is a fairly simple service for doing machine learning work, specifically for uploading datasets, cleansing them, training machine learning models, and then running predictions against them. This was a version 1.0 attempt of doing machine learning work on the Amazon Cloud, and its limited scope appeals to some because it's more easily mastered than some other options. Having said that, the newer Amazon service, SageMaker, largely supersedes Amazon Machine Learning. It does everything that AML does and more and is suitable for working up datasets, training models at high data volumes, hosting those models in production, and even deploying those models to edge devices in Internet of Things, or IoT, scenarios. While Amazon Machine Learning and SageMaker are really general- purpose data science services, Amazon also has more easily consumed AI services available to developers of all stripes. These tend to focus on use cases like upselling recommendations, image processing and recognition, language processing and translation, and so forth. And then in addition to all of these dedicated Amazon services lie a number of subcomponents in Elastic MapReduce. Let's take a look at those. To begin with, let's take a look at Spark Machine Learning, or Spark MLlib. This is a subcomponent of Spark dedicated to doing machine learning work, and much like Spark streaming, works in conjunction with the other Spark components for handling mixed big data workloads. Next is Apache MXNet, available as an optional subcomponent on EMR clusters. This is an open-source project in the Apache Software Foundation's Incubator for doing deep learning, which is a sophisticated subspecialty within machine learning. TensorFlow is perhaps the more mainstream framework and ecosystem for doing deep learning work. It was created at Google and released as open-source software to the industry at large. Like MXNet and Spark itself, it is available as an optional subcomponent on MapReduce clusters. Next is Apache Mahout, which is an older open-source framework, very popular in the early days of Hadoop and now actually based on Apache Spark. Despite that modernization and Mahout's facilitation of creating algorithms in addition to models, Mahout has seen reduced adoption in recent years. Finally, there's JupyterLab and Zeppelin, which, as we said in a previous module, implement coding notebooks, which are the most popular way for doing interactive machine learning work in a variety of programming languages, including Python. Notebooks are a self-contained environment for writing code, executing it, visualizing data, and documenting all of the code and visualizations that are contained in the notebook.
Demo: Machine Learning and Jupyter Notebooks

Rather than talking about notebooks in the abstract though, let's now take a look at a demo of doing machine learning work, specifically against Apache Spark from Jupyter Notebooks running on an EMR cluster. Now in addition to the Apache Hue user interfaces that we saw previously, we can also run the Jupyter notebook environment on Elastic MapReduce. Here we're at the home page for Jupyter, looking at the various assets in our root folder, and we see that the first thing listed is a Jupyter notebook called Census Income Logistic Regression. Let's take a look at that notebook. Here we are in the notebook environment itself, and what we'll be doing in this notebook is using a common sample dataset with US Census data that we can use to build a model that predicts whether a particular census subject's income is less than or equal to $50, 000 on the one hand or greater than $50, 000 annually on the other. We'll be using the Logistic Regression algorithm to build our model. And our first step is simply to load the census income data from a CSV file inside of S3 and to take a look at the first few rows of data in that file, which we see right here. Our next step is to partition that data into sections for training our model and for testing the model once it's been trained. And there's various code in this notebook to save those partitioned versions of the data out to Parquet files and then reload them into the notebook environment. An additional task is to perform a process called indexing, where we take the contents of columns that contain discrete text values, and we create corresponding columns that contain corresponding numerical values. We do this because various machine learning algorithms require numeric data for their inputs. Our next step is to train the model and then run our test data through the model to determine the accuracy of the model that we just trained. There are a few ways of doing this. One is through simple visual inspection, comparing the predicted value to the actual value in our test result set, and we could continue scrolling through the data past these first four rows that are displayed to get a sort of anecdotal sense of the accuracy. A more precise way, however, is to run a query that determines the number of rows where the label and prediction are the same as a ratio to the number of rows in total. And if we run that query, we see here in the notebook that our accuracy is above 82%. To take a more formal data science approach to this, we could plot what's called an ROC curve, and we can actually do that right inside the notebook as we've done here. The dotted diagonal line indicates the accuracy of a simple random guess. The blue line plots the true positive rate of our model against the false positive rate. And the area between the dotted line and the plotted blue line indicates the accuracy of our model which, in this particular case, is 87%. As we saw in this module, Amazon Web Services provides a number of platforms, tools, and options for conducting machine learning work. However, knowing that we can do all of this work in one simple notebook using Apache Spark and its machine learning facilities means that we have a very convenient option for performing machine learning work inside of Elastic MapReduce itself.
Data Warehousing with Amazon Redshift
Overview, Data Warehouse Concepts, MPP Diagrammed

Hi, I'm Andrew Brust, and this module is Data Warehousing with Amazon Redshift. Redshift is Amazon's cloud-based data warehouse. It's a very important service in the AWS data world, and in this module, we'll learn about its lineage, its capabilities, its integration with other Amazon services, and a little bit about its architecture as well. We'll begin though with a discussion of some underlying data warehouse concepts so that we can understand Redshift in full context. So what is a data warehouse? Well, first of all, it's important to know that data warehouse technology has been around for quite some time. So in relative terms for technology, it's already a traditional repository. It's optimized specifically for reading data and doing analysis on that data, and that's interesting because data warehouses are based on the same relational technology that was initially designed for operational workloads. Most operational workloads primarily involve writing data, whether that's creating new rows of data or updating existing rows. And where reading is concerned, typically operational databases look at single rows or small batches of rows. Data warehouses, on the other hand, scan large volumes of rows in order to perform aggregations that are required for analysis. So how can a technology originally designed for operational workloads be optimized for analytical ones? There are several technologies involved. Let's look at a few of them right now. First of all, we have this construct of columnar storage. In columnar, storage, all the values for a given column are stored contiguously, and that makes that scanning and aggregating that we were just talking about a lot more efficient. Columnar storage also makes it possible to get high rates of compression on the data, and that in turn allows data warehouses to load a great deal of that data into memory. When data's memory, analysis on it is even faster. And one more supporting technology here is vector processing. Those are specific operations in a microprocessor that allow handling multiple data values at once instead of one at a time. The combination of these technologies over the last decade into most data warehouse products really revamped that market quite impressively. On the other hand, many vendors in the ecosystem based their business model largely on charging a premium for storage in the warehouse and also used hardware appliances where the amount of storage that could fit was finite. That made for a good revenue story, but it also caused customers to be very frugal and sparing about how much data they put in the warehouse. And this ended up being a drag on the market. What's changed recently though is the idea of putting data warehouses in the cloud using the cloud's elastic scaling capabilities, and in some cases, using the economics of cloud storage to eliminate some of those blockers and factors discouraging customers to use their data warehouses liberally. One more important technology underlying virtually all data warehouses is something called massively parallel processing, or MPP. The gist of an MPP data warehouse architecture is that even though the warehouse looks like a single database server instance; in fact, it contains a whole cluster of instances and one additional instance that sits in front of the rest. That, combined with an array of shared storage, makes for very efficient operations. Specifically, that master node is one that liaises with the user and orchestrates the work amongst all the worker nodes in the cluster. This then creates an abstraction that makes everything look like a single database server instance. Specifically, the user submits a single query to the warehouse, and that query gets routed to the master node. The master node then takes that query and partitions it into an array of smaller queries, assigning each one to a worker node in the cluster. Those worker nodes then execute the queries assigned to them, and they do so at the same time, or in parallel. They each bring back their own result set and then send their constituent result sets back to the master node, which then stitches them all together and sends it back to the user as a single result set. So although the user interacts with the warehouse as if it were a single traditional database server, in fact multiple servers work at the same time to take on really large queries and deliver excellent performance, even over the large volumes of data involved.
Redshift Facts, Redshift, and Data Lakes

Now that we've talked about data warehousing in general, let's get more specific and talk about Redshift. Redshift is a cloud-based MPP data warehouse, and Amazon promotes it as being capable of petabyte-scale analytics. And unlike appliance-based on-premises data warehouses, which have a static sizing, Redshift clusters can be sized elastically, with new nodes being added or indeed removed as workloads require. Now a very salient feature of Redshift is that it uses conventional storage. That is the solid-state drives on the nodes in the clusters, rather than S3 cloud object storage. That makes Redshift different from most other AWS data services and indeed makes Redshift different than several of its cloud data warehouse competitors. Redshift does however integrate with S3, as well as DynamoDB and other AWS data services. And Redshift really was a pioneer in this entire concept of having a data warehouse that was entirely cloud based. As such, it has a very big ecosystem, but it has lost ground to competitors that have come after it. Now let's discuss Redshift as a citizen in the data warehouse ecosystem. First of all, its lineage is in an on-premises data warehouse product that at one time was called ParAccel, that came from a company of the same name. Amazon was an investor in ParAccel and was thus able to adopt its technology for use in Redshift. ParAccel has since been acquired by another database company called Actian. Redshift and ParAccel, like many other MPP data warehouse products, is compatible at a query level with the open-source Postgres database. So if you have resources on your team with Postgres skillsets, they'll be able to onboard to the Redshift platform relatively quickly. Now, again, Redshift does not use object storage. It does not use S3 as its primary storage layer. And because the storage it does use is so tightly coupled to the nodes in its actual clusters, those clusters need to stay up and running 24/7. Some competitors to Redshift do use cloud storage as a primary storage layer, and therefore can offer their customers the ability for the cluster to be paused because the data is persisted in cloud object storage, which never goes down. The result of this is that Redshift offers very good performance. It's a very fast database because those local drives on the nodes can provide much better performance than cloud storage. On the other hand, the necessity to run those clusters 24/7 can make Redshift an expensive product compared to some of its competitors. Now, just because Redshift doesn't use S3 as its primary storage layer doesn't mean there's isn't a good data lake story for Redshift. In fact, there is a good story, and it's a pretty rich one. To begin with, Redshift offers a capability called Redshift Spectrum that allows Redshift to query data files stored in S3 directly. Like many other databases on the market, Spectrum does this by modeling those files as so-called external tables. External tables look like local tables and can be queried as if they were local, but in fact the external table objects are really just pointers to those external files. This allows the data to stay where it is in the file and still allow Redshift users to query them in the same manner as they would query physical local tables. In fact, they can write single queries that join those external tables with the local physical ones and return a single result set. The data lake story extends beyond Spectrum though because Redshift integrates with the data catalog component in AWS Glue. And not only does Redshift integrate with Glue's data catalog, but so do the Hive, Spark SQL, and Presto services on Elastic MapReduce, as well as the Athena service on a standalone basis. That means Redshift data can coexist side by side with the same data customarily queried by these native data lake services. So whether you're working with native data warehouse workloads, data lake workloads via the Redshift Spectrum capability, or integration with other data lake assets using the Glue data catalog, Redshift can handle it all. That completes our drill-down look at AWS's cloud data warehouse, Redshift.
(Big) Data Integration and Pipelines
Overview, Glue, Data Pipeline

Hi, I'm Andrew Brust, and this module is (Big) Data Integration and Pipelines. In this module, we'll take a look at the various technologies and services on AWS that allow data engineers and others to integrate data from multiple sources. Most of these technologies and services are applicable to conventional data scenarios and not just big data scenarios. That's why we have the word big in parentheses. But our focus here is to understand the relevance of these services in a big data context. The first standalone service for us to take a look at is one we've already discussed in passing, and that's AWS Glue. AWS Glue offers a visual interface for authoring data integration pipelines. But that visual interface actually generates real code, and as it happens, that code runs on Apache Spark. But when we say Spark here, we don't mean Spark in the context of EMR. Rather, we're talking about Spark resources that are running in the background and are managed in a serverless fashion so that customers needn't be aware of the discrete Spark infrastructure involved. Now pipelines in Glue rely on the fact that the schema for both the source and the destination are already in Glue's Data Catalog, and we've discussed that Data Catalog in passing already as well. Glue provides direct access to S3, DynamoDB, Redshift, Relational Database Service including Amazon Aurora, and external databases that are accessible via the JDBC technology. The other standalone service we need to discuss in the context of data integration and pipelines is a service that itself is called AWS Data Pipeline. Now, Data Pipeline also provides a visual interface, and in fact, we'll see a demo of it at the end of this module. But this visual interface works a little bit differently than the one in Glue. The Pipeline visual interface is really a canvas for designing data flows using a box-and-line diagramming metaphor. These pipelines can be authored and run directly, or they can be scheduled, and they use a combination of EC2 instances for executing the logic and resources from the various AWS data services that serve as the sources and destinations. EC2 is involved in an approach where infrastructure from Elastic Compute Cloud is provisioned. That infrastructure serves as a runner, in the parlance of Data Pipeline, and that's actually where the pipeline executes. Now unlike Glue, there really is no code that's generated in Data Pipeline. Everything is declarative and visual. But don't let that fool you into thinking that it's an easy service to use. Data Pipeline can be very complex, and a lot of the work that's involved entails filling out complex parameters and supplying sometimes very subtle values to get the configuration of the various pipeline steps just so. This sometimes makes for difficult troubleshooting, especially in the arena of security and permissions. And Pipeline offers direct integration and visual representation in the data flows themselves of S3, Amazon Web Services logs, Elastic MapReduce, Redshift, Relational Database Service, DynamoDB, external databases via JDBC, and even Linux shell commands. And as you can see by the offering of such utilitarian resources as log files and shell commands, Pipeline can be used to script and automate all sorts of operational tasks.
Spark and Pig, Sources and Destinations, Demo Lead-in

Now that's it for standalone services, but let's not forget that various subcomponents on EMR can be used for data integration workloads as well. Specifically, Apache Spark and Apache Pig can be used for this purpose. Why is that? Because core Spark and its data frames, as well as Spark SQL, are all about manipulating data and performing data engineering tasks in general. Apache Pig is focused on data transformation. It has its own dedicated language called Pig Latin, and it can run over MapReduce, Tez, or Spark. Both of these approaches are very code oriented, and the fact remains that lots of techies prefer hand-coding their data pipelines to using visual declarative interfaces like the ones offered by Glue and Data Pipeline. If you're curious, the languages involved are Python and Scala in the Hive context, HiveQL in the Spark SQL context, and the aforementioned Pig Latin when working with Pig. But no matter which language you're using, this really is all about writing scripts, for performing extract, transform, and load tasks, as well as data integration overall. Now let's summarize the connectivity between Pipeline, Glue, and other AWS data services. We'll start with Pipeline, which has direct integrations with RDS, Redshift, and DynamoDB, and also offers integration with S3, as does virtually every single data service on the Amazon Cloud, Elastic MapReduce, shell commands in JDBC. If we take this and compare it to the connectivity on the Glue side, it's almost identical, but we lose that Linux shell capability, and we also lose that direct connectivity to EMR. However, by having connectivity to S3, we are in fact connecting to the storage layer that EMR uses in its primary scenarios.
Demo: Data Pipeline

It's one thing to talk about the capabilities of these services, but let's take a closer look by having a demo of AWS Data Pipeline. Specifically, we're going to take a look at a pipeline that moves data between S3 and Redshift, and you'll see how the visual metaphor, the connectivity, and the supplying of all the parameter values works together. Here we are in an Amazon Data Pipeline designed to move data out of S3 and into Redshift. If you look at the canvas on the left, you'll see shapes representing the various entities involved. Right now, what's selected is the actual Redshift load activity shape, and it's connected to nodes for the S3 data source, the Redshift destination table, and the Elastic Compute Cloud, or EC2, resource on which the pipeline will run. Connection, database name, and login information for the Redshift table is contained in the Redshift cluster shape, and general configuration information for the pipeline is contained in one additional shape on the canvas. Now let's turn our attention to the properties and the parameter values on the right half of the screen. If I select the S3DataNode shape, you'll see that the full path for the actual data file containing the source data is contained in the Directory Path property, but that that in turn references a parameter called myInputS3Loc, for location. If we therefore drill down on parameters and look for a parameter of that same name, we see it on the bottom here. Initially, it just references a bucket called Pluralsight-or, for the Oregon datacenter where the bucket is located, and then a full file name for the source file, which we can see is in CSV format. I've opened that data in Excel, and you can see that it contains information from the Taxi and Limousine Commission in New York City about Medallion vehicles. That information includes license numbers, the owner's name, the model year of the vehicle, and the Medallion type. Back in the canvas, if we click on the shape for the Redshift destination table, you'll see that it orchestrates the execution of a Create Table SQL query, the contents of which is stored in a parameter called myRedshiftCreateTableSql. That too should show below, and indeed it does, although it's a little bit hard to read in a one-line display. But if we copy and paste that SQL query into a text editor, you can see that we are creating a table in Redshift with columns that correspond to those that we saw in the CSV file. Back in the pipeline and looking at other parameter values, you can see that we'll truncate or drop the table if it already exists and that we're specifying that the data source is in CSV format and to ignore the very first row, as it contains column name information. As you can see, the paradigm in Data Pipeline involves a lot of formality and a lot of hierarchy. Here we are doing something relatively simple, and yet we have several entities in the pipeline with configuration parameters spread amongst them. Once you're used to the rubric of working in Data Pipeline, these details can get easier. But there are a lot of niceties involved, and getting everything to execute on the first try can be a bit of a challenge.
Visualizing Your Big Data with QuickSight
Overview, QuickSight Facts, Capabilities, Sharing, Demo Lead-in

I'm Andrew Brust, and this module is Visualizing your Big Data with QuickSight. This is a short module in which we'll cover an ancillary service which, while not critical for doing big data work, can nonetheless be very complementary and allow you to convey the value of all of that lower-level work with external constituencies. Simply put, QuickSight is Amazon's first-party business intelligence, or BI, tool. Now there are a lot of BI tools on the Market, from the cloud providers, from the enterprise software megavendors, and from a number of specialized vendors as well. And you may find that QuickSight is not the most competitive among them. If you're coming to your AWS data work with a BI tool of choice already, it's unlikely that QuickSight will displace it. On the other hand, if BI tools are not your passion and your needs are relatively simple, QuickSight might be just what you need to give a visual element to all of the important rigorous work you'll be doing with the other services discussed in this course. Now QuickSight offers connectivity to a wide stable of AWS data services, including Redshift, Relational Database Service including Aurora, S3, various external databases, and it even provides a native connector for Salesforce, which is a software-as-a-service-based customer relationship management, or CRM, platform. Many self-service BI tools work on the principle of having their own native engine into which data is imported, and QuickSight's no exception there. It has its own engine called the Super-fast, Parallel, In-memory Calculation Engine, or SPICE for short. And data can be imported into SPICE for high-performance calculation and aggregation. On the other hand, QuickSight can also query those data sources directly, leaving the data where it is. While the performance may be a little lower in those cases, it avoids data duplication and in many scenarios, would be preferred. Capabilities-wise, let's start with two really important ones, which is that QuickSight works in a fashion where all you really need to do is select the columns that you want to visualize. QuickSight uses what it calls AutoGraph mode by default, where it will pick what it determines is the best visualization type and configuration based on the data that you've selected. But going beyond that default visualization, there are actually over 20 visualization types inside the product. They are fully configurable and formattable. So even though the AutoGraph mode, where so much is automated, can be convenient, that does not mean that a great deal of customization is out of reach. In fact, advanced users who want to work manually or heavily customize what's automatically generated can do so very productively in QuickSight. And in addition to the core visualization capabilities, QuickSight also offers basic data preparation capabilities, and for advanced users, the ability to specify a customized SQL query to bring back the data in exactly the shape and configuration necessary to get the best visualizations built and shared out. And speaking of sharing, the assets that are built inside of QuickSight can be developed by advanced users and shared with more basic users downstream. That includes the data sources themselves. This would be at the lowest level of sharing where the data connection information and the actual data retrieved is shared, and the downstream users will do everything from there. On the other hand, analyses and dashboards can be shared and still provide a great deal of interactivity for analytically curious users. For perhaps the more executive users who really are interested in a consumption-only experience, QuickSight also offers so-called stories, which are slideshow-like presentations of a collection of visualizations. These stories serve more as documents that consumption users can review directly without having to do additional work.
Demo: QuickSight

Now QuickSight is a visual product, so it makes sense to have a quick demo where we can look at visualizations inside of a QuickSight dashboard that is connected to data in Redshift and S3. Now as we've discussed in this module, QuickSight provides entry-level business intelligence, or BI, capabilities right on the AWS Cloud. And QuickSight has the ability, of course, to connect to a number of native Amazon first-party data services and bring data in from them for coordinated analysis within QuickSight itself. As an able demonstration of that, we have two datasets defined here. One points to Redshift to a dataset containing information from the New York City Taxi and Limousine Commission on Medallion vehicles. The other contains Florida insurance information from a CSV file stored in an S3 bucket. If we move first into a QuickSight analysis, we can see a bar chart on the left of data from the New York City Taxi and Limousine Commission. Specifically, we're looking at a count of vehicles broken down by model year and Medallion type. To the right, we're seeing our Florida insurance data plotted in a map visualization using the latitude and longitude data in the dataset itself. There's lots we can do here, but let's take a look at the data story facility, and if we click on the first scene in the story, we'll see the Florida insurance data, first looking at a close-up of the map for data in Miami. If we forward to the next scene, we'll see corresponding information for Key West. And if we go to the final scene, we'll see that same information for Orlando. If we stop the story, we can return to the standard analysis view, and if we click on Visualize, we're back to the menu of visualization types that we had at the beginning. While this analysis itself is shareable, we can also share a dashboard view of the analysis that still shows us the same data and the same visualizations. It still gives us interactivity, for example, the ability to set different filter values, but it ensures the wider audience that has access to this dashboard view can't modify or edit the analysis in a significant way. Now, despite the relative entry-level capabilities of QuickSight, it is still very convenient to have a first-party service that allows us to create useful visualizations of our data without leaving the AWS Cloud or the world of first-party AWS services.
Strategy
Overview, Major Services Recap, Capabilities Matrix

My name's Andrew Brust, and this module is Strategy. This is our concluding module, in which we zoom up 10, 000 feet, summarize all the services that we've looked at the workloads that map to them, and then share with you a matrix that provides a full inventory of those services so you can best understand which combination of services will be most relevant to you and your requirements. Now we've drilled down into so many services that it makes sense to review which are the headline services and what workloads they cover. Starting from the top, Redshift is Amazon's data warehouse service. S3 and Athena together constitute Amazon's data lake stack in minimalist fashion. Relational databases can be found in RDS, the Relational Database Service. NoSQL workloads are handled by the banner service, DynamoDB, as well as other NoSQL services that we covered earlier in the course. Heading back to the data lake, batch analytics is best handled by Elastic MapReduce and its numerous subcomponents. And streaming data analytics are addressed by Amazon Kinesis and the MKS. Now here's a table providing an inventory of the various services that we've looked at on the Amazon Cloud, as well as a couple of third-party services at the bottom, Databricks and Snowflake, which run on the Amazon Cloud. Notice literally all of these services have some interplay with data lake scenarios. Many of them are based on open-source software or have open-source software components within them. Redshift and the third-party Snowflake service are data warehouse services. A number of the services here address data engineering workloads. As we've already pointed out, Databricks and Snowflake are third-party services. And we've also specified which subset of services handles streaming data workloads and machine learning requirements. It may be very useful for you to pause the video now and study this table carefully or to come back to it as a reference asset. I hope this approach of starting with high-level concepts then drilling down on individual services and then concluding with a summary and an inventory of services and capabilities has been useful. I also hope this course has piqued your interest to the point where you'd like to learn more about some of these services or have members on your team do likewise. For Pluralsight, this is Andrew Brust, thanking you for taking and reviewing this course and wishing you luck with your big data projects.

Hadoop

AWS Big Data Services

No comments:

Post a Comment